13 research outputs found

    Domain ontology learning from the web

    Get PDF
    El Aprendizaje de Ontologías se define como el conjunto de métodos utilizados para construir, enriquecer o adaptar una ontología existente de forma semiautomática, utilizando fuentes de información heterogéneas. En este proceso se emplea texto, diccionarios electrónicos, ontologías lingüísticas e información estructurada y semiestructurada para extraer conocimiento. Recientemente, gracias al enorme crecimiento de la Sociedad de la Información, la Web se ha convertido en una valiosa fuente de información para casi cualquier dominio. Esto ha provocado que los investigadores empiecen a considerar a la Web como un repositorio válido para Recuperar Información y Adquirir Conocimiento. No obstante, la Web presenta algunos problemas que no se observan en repositorios de información clásicos: presentación orientada al usuario, ruido, fuentes no confiables, alta dinamicidad y tamaño abrumador. Pese a ello, también presenta algunas características que pueden ser interesantes para la adquisición de conocimiento: debido a su enorme tamaño y heterogeneidad, se asume que la Web aproxima la distribución real de la información a nivel global. Este trabajo describe una aproximación novedosa para el aprendizaje de ontologías, presentando nuevos métodos para adquirir conocimiento de la Web. La propuesta se distingue de otros trabajos previos principalmente en la particular adaptación de algunas técnicas clásicas de aprendizaje al corpus Web y en la explotación de las características interesantes del entorno Web para componer una aproximación automática, no supervisada e independiente del dominio. Con respecto al proceso de construcción de la ontologías, se han desarrollado los siguientes métodos: i) extracción y selección de términos relacionados con el dominio, organizándolos de forma taxonómica; ii) descubrimiento y etiquetado de relaciones no taxonómicas entre los conceptos; iii) métodos adicionales para mejorar la estructura final, incluyendo la detección de entidades con nombre, atributos, herencia múltiple e incluso un cierto grado de desambiguación semántica. La metodología de aprendizaje al completo se ha implementado mediante un sistema distribuido basado en agentes, proporcionando una solución escalable. También se ha evaluado para varios dominios de conocimiento bien diferenciados, obteniendo resultados de buena calidad. Finalmente, se han desarrollado varias aplicaciones referentes a la estructuración automática de librerías digitales y recursos Web, y la recuperación de información basada en ontologías.Ontology Learning is defined as the set of methods used for building from scratch, enriching or adapting an existing ontology in a semi-automatic fashion using heterogeneous information sources. This data-driven procedure uses text, electronic dictionaries, linguistic ontologies and structured and semi-structured information to acquire knowledge. Recently, with the enormous growth of the Information Society, the Web has become a valuable source of information for almost every possible domain of knowledge. This has motivated researchers to start considering the Web as a valid repository for Information Retrieval and Knowledge Acquisition. However, the Web suffers from problems that are not typically observed in classical information repositories: human oriented presentation, noise, untrusted sources, high dynamicity and overwhelming size. Even though, it also presents characteristics that can be interesting for knowledge acquisition: due to its huge size and heterogeneity it has been assumed that the Web approximates the real distribution of the information in humankind. The present work introduces a novel approach for ontology learning, introducing new methods for knowledge acquisition from the Web. The adaptation of several well known learning techniques to the web corpus and the exploitation of particular characteristics of the Web environment composing an automatic, unsupervised and domain independent approach distinguishes the present proposal from previous works.With respect to the ontology building process, the following methods have been developed: i) extraction and selection of domain related terms, organising them in a taxonomical way; ii) discovery and label of non-taxonomical relationships between concepts; iii) additional methods for improving the final structure, including the detection of named entities, class features, multiple inheritance and also a certain degree of semantic disambiguation. The full learning methodology has been implemented in a distributed agent-based fashion, providing a scalable solution. It has been evaluated for several well distinguished domains of knowledge, obtaining good quality results. Finally, several direct applications have been developed, including automatic structuring of digital libraries and web resources, and ontology-based Web Information Retrieval

    Ontology-driven web-based semantic similarity

    Get PDF
    The version of record is available online at: http://dx.doi.org/10.1007/s10844-009-0103-xEstimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge—such as the structure of a taxonomy—or implicit knowledge—such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies –like specific domain ontologies- and massive corpus –like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures’ dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.Peer ReviewedPostprint (author's final draft

    Semantic similarity estimation from multiple ontologies

    Get PDF
    The version of record is available online at: http://dx.doi.org/10.1007/s10489-012-0355-yPeer ReviewedPostprint (author's final draft

    Knowledge-driven delivery of home care services

    Get PDF
    The version of record is available online at: http://dx.doi.org/10.1007/s10844-010-0145-0Home Care (HC) assistance is emerging as an effective and efficient alternative to institutionalized care, especially for the case of senior patients that present multiple co-morbidities and require life long treatments under continuous supervision. The care of such patients requires the definition of specially tailored treatments and their delivery involves the coordination of a team of professionals from different institutions, requiring the management of many kinds of knowledge (medical, organizational, social and procedural). The K4Care project aims to assist the HC of elderly patients by proposing a standard HC model and implementing it in a knowledge-driven e-health platform aimed to support the provision of HC services.Peer ReviewedPostprint (author's final draft

    Toward sensitive document release with privacy guarantees

    No full text
    DOI: 10.1016/j.engappai.2016.12.013 URL: http://www.sciencedirect.com/science/article/pii/S0952197616302408 Filiació URV: SI Inclòs a la memòria: SIPrivacy has become a serious concern for modern Information Societies. The sensitive nature of much of the data that are daily exchanged or released to untrusted parties requires that responsible organizations undertake appropriate privacy protection measures. Nowadays, much of these data are texts (e.g., emails, messages posted in social media, healthcare outcomes, etc.) that, because of their unstructured and semantic nature, constitute a challenge for automatic data protection methods. In fact, textual documents are usually protected manually, in a process known as document redaction or sanitization. To do so, human experts identify sensitive terms (i.e., terms that may reveal identities and/or confidential information) and protect them accordingly (e.g., via removal or, preferably, generalization). To relieve experts from this burdensome task, in a previous work we introduced the theoretical basis of C-sanitization, an inherently semantic privacy model that provides the basis to the development of automatic document redaction/sanitization algorithms and offers clear and a priori privacy guarantees on data protection; even though its potential benefits C-sanitization still presents some limitations when applied to practice (mainly regarding flexibility, efficiency and accuracy). In this paper, we propose a new more flexible model, named (C, g(C))-sanitization, which enables an intuitive configuration of the trade-off between the desired level of protection (i.e., controlled information disclosure) and the preservation of the utility of the protected data (i.e., amount of semantics to be preserved). Moreover, we also present a set of technical solutions and algorithms that provide an efficient and scalable implementation of the model and improve its practical accurac

    Privacy-preserving data outsourcing in the cloud via semantic data splitting

    No full text
    DOI: 10.1016/j.comcom.2017.06.012 URL: https://pubs-acs-org.sabidi.urv.cat/doi/abs/10.1021/jacs.7b06800 Filiació URV: SIEven though cloud computing provides many intrinsic benefits (e.g., cost savings, availability, scalability, etc.), privacy concerns related to the lack of control over the storage and management of the outsourced (confidential) data still prevent many customers from migrating to the cloud. In this respect, several privacy-protection mechanisms based on a prior encryption of the data to be outsourced have been proposed. Data encryption offers robust security, but at the cost of hampering the efficiency of the service and limiting the functionalities that can be applied over the (encrypted) data stored on cloud premises. Because both efficiency and functionality are crucial advantages of cloud computing, especially in SaaS, in this paper we aim at retaining them by proposing a privacy-protection mechanism that relies on splitting (clear) data, and on the distributed storage offered by the increasingly popular notion of multi-clouds. Specifically, we propose a semantically-grounded data splitting mechanism that is able to automatically detect pieces of data that may cause privacy risks and split them on local premises, so that each chunk does not incur in those risks; then, chunks of clear data are independently stored into the separate locations of a multi-cloud, so that external entities (cloud service providers and attackers) cannot have access to the whole confidential data. Because partial data are stored in clear on cloud premises, outsourced functionalities are seamlessly and efficiently supported by just broadcasting queries to the different cloud locations. To enforce a robust privacy notion, our proposal relies on a privacy model that offers a priori privacy guarantees; to ensure its feasibility, we have designed heuristic algorithms that minimize the number o

    Toward sensitive document release with privacy guarantees

    No full text
    Privacy has become a serious concern for modern Information Societies. The sensitive nature of much of the data that are daily exchanged or released to untrusted parties requires that responsible organizations undertake appropriate privacy protection measures. Nowadays, much of these data are texts (e.g., emails, messages posted in social media, healthcare outcomes, etc.) that, because of their unstructured and semantic nature, constitute a challenge for automatic data protection methods. In fact, textual documents are usually protected manually, in a process known as document redaction or sanitization. To do so, human experts identify sensitive terms (i.e., terms that may reveal identities and/or confidential information) and protect them accordingly (e.g., via removal or, preferably, generalization). To relieve experts from this burdensome task, in a previous work we introduced the theoretical basis of C-sanitization, an inherently semantic privacy model that provides the basis to the development of automatic document redaction/sanitization algorithms and offers clear and a priori privacy guarantees on data protection; even though its potential benefits C-sanitization still presents some limitations when applied to practice (mainly regarding flexibility, efficiency and accuracy). In this paper, we propose a new more flexible model, named (C, g(C))-sanitization, which enables an intuitive configuration of the trade-off between the desired level of protection (i.e., controlled information disclosure) and the preservation of the utility of the protected data (i.e., amount of semantics to be preserved). Moreover, we also present a set of technical solutions and algorithms that provide an efficient and scalable implementation of the model and improve its practical accuracy, as we also illustrate through empirical experiments

    Toward sensitive document release with privacy guarantees

    No full text
    Privacy has become a serious concern for modern Information Societies. The sensitive nature of much of the data that are daily exchanged or released to untrusted parties requires that responsible organizations undertake appropriate privacy protection measures. Nowadays, much of these data are texts (e.g., emails, messages posted in social media, healthcare outcomes, etc.) that, because of their unstructured and semantic nature, constitute a challenge for automatic data protection methods. In fact, textual documents are usually protected manually, in a process known as document redaction or sanitization. To do so, human experts identify sensitive terms (i.e., terms that may reveal identities and/or confidential information) and protect them accordingly (e.g., via removal or, preferably, generalization). To relieve experts from this burdensome task, in a previous work we introduced the theoretical basis of C-sanitization, an inherently semantic privacy model that provides the basis to the development of automatic document redaction/sanitization algorithms and offers clear and a priori privacy guarantees on data protection; even though its potential benefits C-sanitization still presents some limitations when applied to practice (mainly regarding flexibility, efficiency and accuracy). In this paper, we propose a new more flexible model, named (C, g(C))-sanitization, which enables an intuitive configuration of the trade-off between the desired level of protection (i.e., controlled information disclosure) and the preservation of the utility of the protected data (i.e., amount of semantics to be preserved). Moreover, we also present a set of technical solutions and algorithms that provide an efficient and scalable implementation of the model and improve its practical accuracy, as we also illustrate through empirical experiments

    Privacy-preserving data outsourcing in the cloud via semantic data splitting

    No full text
    Even though cloud computing provides many intrinsic benefits (e.g., cost savings, availability, scalability, etc.), privacy concerns related to the lack of control over the storage and management of the outsourced (confidential) data still prevent many customers from migrating to the cloud. In this respect, several privacy-protection mechanisms based on a prior encryption of the data to be outsourced have been proposed. Data encryption offers robust security, but at the cost of hampering the efficiency of the service and limiting the functionalities that can be applied over the (encrypted) data stored on cloud premises. Because both efficiency and functionality are crucial advantages of cloud computing, especially in SaaS, in this paper we aim at retaining them by proposing a privacy-protection mechanism that relies on splitting (clear) data, and on the distributed storage offered by the increasingly popular notion of multi-clouds. Specifically, we propose a semantically-grounded data splitting mechanism that is able to automatically detect pieces of data that may cause privacy risks and split them on local premises, so that each chunk does not incur in those risks; then, chunks of clear data are independently stored into the separate locations of a multi-cloud, so that external entities (cloud service providers and attackers) cannot have access to the whole confidential data. Because partial data are stored in clear on cloud premises, outsourced functionalities are seamlessly and efficiently supported by just broadcasting queries to the different cloud locations. To enforce a robust privacy notion, our proposal relies on a privacy model that offers a priori privacy guarantees; to ensure its feasibility, we have designed heuristic algorithms that minimize the number of cloud storage locations we need; to show its potential and generality, we have applied it to the least structured and most challenging data type: plain textual documents

    Semantic noise: Privacy-protection of nominal microdata through uncorrelated noise addition

    No full text
    Peer-reviewedPersonal data are of great interest in statistical studies and to provide personalized services, but its release may impair the privacy of individuals. To protect the privacy, in this paper, we present the notion and practical enforcement of semantic noise, a semantically-grounded version of the numerical uncorrelated noise addition method, which is capable of masking textual data while properly preserving their semantics. Unlike other perturbative masking schemes, our method can work with both datasets containing information of several individuals and single data. Empirical results show that our proposal provides semantically-coherent outcomes preserving data utility better than non-semantic perturbative mechanisms.Los datos personales son de gran interés en los estudios estadísticos y para proporcionar servicios personalizados, pero su lanzamiento puede poner en peligro la privacidad de los individuos. Para proteger la privacidad, en este trabajo presentamos el concepto y la aplicación práctica de ruido semántico, una versión basada en la semántica del método de adición de ruido no correlacionado numérico, que es capaz de enmascarar los datos textuales, preservando adecuadamente su semántica. A diferencia de otros sistemas de encubrimiento perturbativos, nuestro método puede trabajar con ambos conjuntos de datos que contienen información de varios individuos y los datos individuales. Los resultados empíricos muestran que nuestra propuesta proporciona resultados semánticamente coherentes preservando mejor la utilidad de los datos que los mecanismos perturbativos no semánticos.Les dades personals són de gran interès en els estudis estadístics i per proporcionar serveis personalitzats, però el seu llançament pot posar en perill la privacitat dels individus. Per protegir la privacitat, en aquest treball presentem el concepte i l'aplicació pràctica de soroll semàntic, una versió basada en la semàntica del mètode d'addició de soroll no correlacionat numèric, que és capaç d'emmascarar les dades textuals, preservant adequadament la seva semàntica. A diferència d'altres sistemes d'encobriment pertorbatius, el nostre mètode pot treballar amb dos conjunts de dades que contenen informació de diversos individus i les dades individuals. Els resultats empírics mostren que la nostra proposta proporciona resultats semànticament coherents preservant millor la utilitat de les dades que els mecanismes pertorbatius no semàntics
    corecore